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Abstract 

Recently, a wealth of epigenomic data has been generated by biochemical assays and 
next-generation sequencing (NGS) technologies. In particular, histone modification data 
generated by the ENCODE project and other large-scale projects show specific patterns 
associated with regulatory elements in the human genome. It is important to build a 
unified statistical model to decipher the patterns of multiple histone modifications in a 
cell type to annotate chromatin states such as transcription start sites, enhancers and 
transcribed regions rather than to map histone modifications individually to regulatory 
elements. 

Several genome-wide statistical models have been developed based on hidden Markov 
models (HMMs) . These methods typically use the Expectation-Maximization (EM) algo- 
rithm to estimate the parameters of the model. Here we used spectral learning, a state- 
of-the-art parameter estimation algorithm in machine learning. We found that spectral 
learning plus a few (up to five) iterations of local optimization of the likelihood outper- 
forms the standard EM algorithm. We also evaluated our software implementation called 
Spectacle on independent biological datasets and found that Spectacle annotated ex- 
perimentally defined functional elements such as enhancers significantly better than a 
previous state-of-the-art method. 

Spectacle can be downloaded from https://github.com/jiminsong/Spectacle . 



1 Introduction 



Identifying regulatory elements in the genome is a challenging problem with applications in 
understanding many aspects of biology, including the molecular mechanisms of disease. In 
particular, cell- type specific gene regulation cannot be explained solely by genome sequence 
data because the genome is essentially the same in most cell-types. Recently, the ENCODE 
project (ENCODE Project Consortium 2012) produced a wealth of epigenetic data using 
biochemical assays and next-generation sequencing (NGS) technologies. This data has the 
potential to significantly improve our understanding of cell-type specific gene regulation. In 
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addition to ENCODE, the International Human Epigenome Consortium (IHEC, http://ihec- 
epigenomes.org/) also aims to produce reference maps of 1000 human epigenomes. It includes 



several major projects such as BLUEPRINT and the Roadmap Epigenomics Project (Adams 



et al. 2012 Bernstein et al. 2010). Here we refer to epigenetic data as histone modifica- 



tion, DNA methylation, transcription factor binding, open chromatin and other related types 
of data generated by biochemical assays. In particular, histone modification refers to post- 
translational modification of histones where chemical groups such as methyl, acetyl, phos- 



phate, ubiquitin, SUMO, ADP-ribose etc are added or deleted to histone tails (Portela and 
Esteller 2010). The "epigenome" provides another dimension to our understanding of the 



human genome and it can be used to identify cell-type specific gene regulatory elements or 
disease variants (Rivera and Ren 2013). Genomic features delineated by epigenomic marks 



have been linked to genome-wide association study (GWAS) loci ( Ernst et aZ.||2011 Maurano 



et al. 2012). Since up to 93% of GWAS SNPs may be located in non-coding regions, such 



results give hope that one might be able to fine-map the causal disease variants. 

Several systematic methods have been developed to identify regulatory elements or anno- 
tate chromatin states from epigenetic data (mainly, histone modification data) not only in the 
human genome but also in the Drosophila, mouse and yeast genomes ( |Ernst and Kellis||2010[ 
Filion et al.f20W\ |Kharchenko et a£.||20TT| |Hoffman et aT1|20T2l |Shen et al]f2012\ |Wang et al. 



20T21 |Biesinger et aL||20T3l |Hoftman et~al\^m3\ |Lai and Buckl[2013l |Won et aZ.|[2Q13t [Zemg 



et al.\2013 ). Among these methods, Hidden Markov Models (HMMs) were repeatedly used as 



the underlying probabilistic model. HMMs (Hidden Markov Models) are widely used in many 
different fields including natural language processing and computational biology. In order to 
annotate chromatin states, we can design an HMM composed of hidden states which represent 
chromatin states such as transcription start sites (TSSs), promoters, enhancers, transcribed 
regions, non-coding RNA regions etc., and emission states which represent combinations of 
histone marks. 

To estimate the parameters of an HMM in an unsupervised way (i.e. without access to 
labelled examples), the EM (Expectation- Maximization) algorithm has been the standard 
algorithm for a long time ( jDempster et al. 1977 Rabiner 1989). The EM algorithm is a 
likelihood approach that iteratively converges to a local optimum in the likelihood of the 
data. It is iterative so it is often slow and it is somewhat arbitrary to decide when to stop 
the iterations. Since EM is not guaranteed to find a global optimum, effective parameter 
initialization is important to achieve good practical performance (Huang et al. 2001). 

Here we employed spectral learning, a state-of-the-art algorithm in machine learning to 
estimate the parameters of an HMM, which was recently developed by ( Hsu et aT]|2012 ). It is 



a Method of Moments approach which predates the Maximum Likelihood approach (Pearson 



1895) and is different by nature from the EM algorithm. Instead of attempting to find the 
maximum likelihood solution, it expresses various unobservable moments of the model as 
functions of the parameters, sets these moments equal to the sample moments estimated from 
the data and solves these equations to estimate the parameters. 

One advantage of spectral learning is that it is provably correct under mild assumptions, 
specifically that the transition and emission matrices of the HMM are full rank and that the 
initial state vector is strictly greater than 0 in all coordinates. It has theoretically provable 
polynomial sample complexity bounds, which gives us a way to analyze the minimum sample 
size needed to infer the HMM parameters correctly. It does not suffer from local optima issues 
so it does not depend on the parameter initialization. 
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For practical applications, the spectral algorithm can be much faster than the EM algo- 
rithm. The main computation in spectral learning is to compute the SVD (singular value 
decomposition) of a matrix whose dimension depends on the number of histone marks. Thus 
the time complexity of the computation is not dominated by the genome size or the number 
of hidden states. 

Our main technical contribution is to develop a practical implementation of spectral learn- 
ing for HMMs for annotating chromatin domains that does not suffer from numerical insta- 
bility issues. The original method of (Hsu et al. 2012) learned the HMM parameters only 
up to an unknown invertible transformation. The transformed parameters are still useful for 
computing the likelihood of the data and predicting future observations but they cannot be 
used to annotate the hidden chromatin states. Mossel and Roch showed how to infer the 
HMM parameters explicitly for the special case where the state and observation spaces have 
the same dimension ( |Mossel and Roch 2006) and (Hsu et al. 2012) extended their method 
to more general HMMs. However, the published method can be unstable if the eigenvalues 
corresponding to the chromatin states are not well separable and the method relies on inject- 
ing noise into the eigenvector computation to spread the eigenvalues (Hsu et al. 2012[ ). We 
have improved the method in a deterministic and principled way to infer the (untransformed) 
HMM parameters by using the empirical observation that for chromatin data, the state space 
is much smaller than the observation space and there is a major observation that we call an 
"anchor observation" for each state. 

In addition, the method of ( Hsu et aT||2012 ) assumed that we have access to many short, 
independent samples from the HMM. We have modified their method to our application 
where we have a few long samples from the HMM by rewriting the formula for the transition 
matrix parameters in a form that does not depend on the initial state probability distribution. 
Finally, we have contributed empirically developed solutions for several technical issues for our 
application of chromatin state annotation in order to deal with noise in the data, overfitting 
and negative probability values. 

We ran our method called Spectacle (SPECTral learning for Annotating Chromatin La- 
bels and Epigenomics) for different numbers of states and multiple cell types. In most cases, 
Spectacle found a higher likelihood of the data than ChromHMM (Ernst and Kellis 2012) 
which has the same underlying statistical model but uses the EM algorithm for inference. 
Moreover, Spectacle was consistently much faster than ChromHMM for training parame- 
ters. Since ENCODE generated data for 147 cell types and much more epigenomic map data 
is expected to be produced in the future, we believe that Spectacle will scale up much bet- 
ter to tackling such data sets. In addition we picked one first tier cell type from ENCODE, 
GM12878, to perform external validations. We utilized independent biological datasets such 
as TSS, transcribed regions, lincRNAs (long intergenic non-coding RNAs) and enhancers to 
evaluate our method. Overall, our method uncovered patterns of histone modification marks 
associated with biological datasets significantly better than ChromHMM. Finally, Specta- 
cle was implemented in Java by modifying the ChromHMM code. Spectacle should be 
easily portable to most desktops and the source code is freely available online. 
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2 Methods 



2.1 Related work 

To identify regulatory elements in the human genome from histone modification data, there are 
two broad classes of approaches- supervised and unsupervised learning. Supervised learning 
associates patterns of histone modifications with known chromatin states such as enhancers, 
promoters and non-coding RNAs. It can have good performance for known chromatin states 
but it requires the availability of known examples and it cannot discover new types of chro- 
matin states flGuttman et a£.||2009[ |Heintzman et al.||2009| |Filion et a/.1|2010| |Yip et a/.|[20T2 



Won et al. 2013). Unsupervised learning discovers patterns of histone modifications for each 



Ucar et al. 2011 Ernst and Kellis 



2012 



ChromHMM (Ernst and Kellis 



chromatin state de novo (Lian et al. 2008 Jaschek and Tanay 2009 Ernst and Kellis 2010 



Hoffman et al. 2012) 



2010 



2012) uses a multivariate Hidden Markov Model 



(HMM) to predict chromatin states from histone mark data. Each chromosome is segmented 
into non-overlapping regions of 200bp (base pair) and each segment has a binary value rep- 
resenting the presence/absence of each histone mark. Given a fixed number of hidden states, 
each segment emits a specific combination of histone marks in a hidden state. To perform 



unsupervised learning of the HMM parameters, they used the standard EM algorithm (Demp- 



ster et aT||1977 ), also called the Baum- Welch algorithm in the context of HMMs. For the EM 
algorithm to avoid local optima, it is important to initialize the parameters well. Instead 
of initializing the parameters at random, they proposed a heuristic method called "infor- 



mation" (Ernst and Kellis 2012) which we briefly describe in the Supplementary Material. 



ChromHMM ran the EM algorithm to convergence, that is, the difference between the likeli- 
hood of the current iteration and that of the previous iteration was less than 0.001. Regardless 
of convergence, the maximum number of iterations was set to 200. 

Hoffman et al. 2012 used a generalization of HMMs called "dynamic Bayesian networks" 
(DBN) to model chromatin states. For example, their program, Segway, learns how long each 
state lasts by adding another hidden state called a countdown variable. Segway has high- 
resolution since it uses a segment size of lbp. However, it is much slower than ChromHMM 
because of the high-resolution and the complexity of the model. Nevertheless, the performance 



of Segway on biological data sets seems to be similar to ChromHMM (Hoffman et al. 2013). 



Since the state space is much bigger than ChromHMM, it might be harder to find the global 
optimum of the likelihood. Currently, Segway cannot be run on a desktop but needs to be 
run on a compute cluster. Thus we did not use it for comparison in this work. 

In addition, there are a few other methods that are more focused on different aspects 



of the problem such as enabling joint analysis of multiple datasets (Zeng et al. 2013) and 
incorporating the lineage information between cell types ( Biesinger et q/.|2013 ). As described 
in the Introduction section, a number of methods used HMMs to model the chromatin states. 
These methods differ mainly in the way they model the histone mark data. For instance, 
some methods discretize the data while others fit Gaussian distributions to the data. In this 
work we do not explore different methods of modeling the raw histone mark data but rather 
focus on the parameter estimation technique. 
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2.2 Description of the Hidden Markov Model (HMM) 

We use HMMs to represent the chromatin states as hidden states and all possible combina- 
tions of (binarized) histone marks as observations. The whole genome is divided into segments 
of size 200bp following ( Ernst and Kellis||2010 2012). We define the HMM in matrix form as 
follows. 



Let K be the number of hidden chromatin states and N be the number of possible com- 
binations of histone marks (i.e., N = 2 M where M = number of histone marks). Let A be 
the state transition matrix where Aij is the probability of transition from state j to state i 
for 1 < i, j < K . Let O be the emission matrix where Ojj is the probability of observing the 
i-th combination of the histone marks in state j where 1 < i < N and 1 < j < K. Let ir be 
the initial state distribution vector where 7Tj is the probability of state i in the first segment 
of each chromosome when 1 < i < K . For simplicity of the method description, the whole 
genome is considered as one chromosome by concatenating all chromosomes. Let T be the 
number of segments in the genome. Let xt be the observation at the t-th segment and let 
xt r .t 2 represent xt t , x^+i, xt 2 for t\ < t2- Given parameters, 9 = (A,0,tt), the likelihood 
of an observed sequence is as follows. 



P( Xl:T \9)= Yl P(*l:T,hl:T\e) 
hi,h 3 ,...,h T 

= l^AO XT ...AO X2 AO Xl ir 



tf<B XT ...B X2 B Xl ir 



where ht in the first equation is the hidden state at the t-th segment and the summation 



is taken over all possible sequences of hidden states. Bi is an observable operator (Jaeger 



2000) and defined as Bi = AOi where Oi = diag{Oi^, 0^2, Oi^k) for 1 < i < N, and 1^ 



is 



a vector [1, 1, 1] of size K. 



2.2.1 Inference of the HMM parameters 

The method we used was adapted from a method of Mossel and Roch for inferring general phy- 



logenetic tree models (Mossel and Roch 2006). 
T 



For the observed triple data, (x t , xt+i, xt+2) 



1 < t < T — 2, the marginal probabilities of observing the counts of singletons, pairs and 
triples in the data (the moments in the name "Method of Moments") are defined as follows: 



{Pi}i = Pr[x t =i],l<i<N 
{P2,i}i,j = Pr[x t +i = i,x t = < i,j < N 
{P3,x,i}i,j = Pr[x t +2 = i,x t +i = x,x t = j], 1 < i,x, j < N 

{^3,l} i,j = { P 3,x,l}i,j 
l<x<N 

= Pr[x t+ 2 =i,x t = j],l < i,j < N 
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Note that the counts of triples (the third moment) is actually a third-order tensor but 
for computational reasons we represent it by a collection of matrices indexed by the middle 
observation x where each matrix corresponds to a slice through the tensor. |Hsu et af. 2012 



showed that we can infer the HMM parameters from these marginal probabilities as follows. 
Let U be an N x K matrix of the top K left singular vectors (computed by the singular value 
decomposition) of P^,i- Intuitively, U acts as surrogate for the observation matrix O. 

We computed the following matrix C x for each observation x by describing the sample 
moments using the parameters, Ps, x ,l = OAO x Adiag(ir)0 T and i^i = OAAdiag(ir)0 T , 

C x := (U T P 3 , xA )(U T P 3A ) + = (^OA^i^OA)- 1 (1) 

where M + is the Moore-Penrose pseudoinverse of M. Since O x is a diagonal matrix, it is 
easily seen that U T OA represents the eigenvectors of the matrix and the diagonal elements 
of O x are exactly the eigenvalues. We will discuss how to compute the eigenvectors in our 



practical implementation in section 2.2.2 For now, suppose that the eigenvectors, U OA 
are given. Then for each observation x, 



(iFOA^Cxil^OA) 
= {U T OA)- 1 {U T OA)O x {U T OAY 1 (U T OA) = O x (2) 

Thus we can infer the emission matrix elements for x. The emission matrix O is inferred 
by combining all the O x 's. 

Note that we cannot infer the emission matrix elements directly from the eigenvalues computed 
from the above matrix C x because we do not know the order of the eigenvalues. We circumvent 
this problem by utilizing the fact that the eigenvectors are the same for all observations. 

Given the emission probabilities, the remaining parameters of the HMM, n and A, are 
easily computed by expressing the sample moments in terms of the parameters, Pi = Ott, 
P2,i = OAdiag(ir)0 T , and the assumption that A,0,diag(ir) are rank K (i.e. full-rank) as 
follows: 



vr = 0 + Pi (3) 
A = (0 + 0)A(diag(-K)diag(ir)- 1 ) 

= (0 + 0)Adiag(ir)(0 + 0) T diag{Tr)- 1 

= 0 + (OAdiag(TT)0 T )(0 + ) T diag(Tr)- 1 

= 0 + P 2A (0 + ) T diag(7r)- 1 (4) 



The original algorithm of |Hsu et a/. | [2012 assumed that we have access to many short 



samples from the HMM. When adapting the algorithm to the case of a few long samples, we 
found that the distribution of initial observations was quite different from the distribution of 
all observations. Estimating the state initial distribution tt from the the distribution of all 
observations Pi introduces a significant amount of noise. Therefore, we modified Pi to P| m * 
which is the distribution of the first segment of all the chromosomes and we slightly modified 
Equation [3] as 

7T = 0 + P{ nit (5) 
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More significantly, we empirically found that most entries of tt are close to zero and so 
taking the inverse of tt in the calculation of A would introduce noise. Thus, we modifed the 
computation of A to remove the dependence on tt as follows: 



A = A{Adiag{ir)0 T ){Adiag{Ti)0 T ) + 

= {(0 + 0)A(Adiag(Tr)0 T ))((0 + 0){Adiag(ir)0 T )) + 

= (0 + P 3 ,i)(0 + J P 2 ,i) + (6) 
2.2.2 Computing the eigenvectors using anchor observations 

It is not easy to compute the eigenvectors correctly because of sample noise. If we compute 
U T OA from the matrix for each observation, we will generally get different U T OA , s for 
different observations in practice. Thus, we used a method based on our empirical observations 
about the histone mark matrix. 

The observation space is N = 2 M where M is the number of histone marks. The maximum 
number of states is N. But, in fact, we observed that K is much lower than N, consistent with 
previous papers. Also, for each state, we observed there is a major observation that most of 
the segments in the state emit. This observation is related to a similar condition considered 



previously in the topic models setting (Arora et al. 2012). For each state i, we picked an 



"anchor observation" x' = argmax x {U^. f } which is similar to using a key word to represent a 



topic in the topic models setting (Anandkumar et al. 2012). The "anchor observation" tends 



to be the maximum value in the row of U, that is, it tends to appear in the state much more 
often than in the other states. Then we computed eigenvectors from C x > and extracted a single 
eigenvector corresponding to the largest eigenvalue, which in practice was separated well from 
the other eigenvalues. We did this for all states separately and combined the eigenvectors into 
a single matrix. 

Algorithm 1: Spectral learning algorithm 

Data: K: number of states, N: number of observations, T: number of segments 

Estimate P\, Pz,x, P$,i, and P^,x,i from the data. 

(P l}i= #(*p) 

ra i. . _ #(xt+i=i,x t =j) 
1-^2,1/1 j — t-1 

rp l _ #{xt+2=i,x t =j) 
V3,l/i,j — t-2 
/ p \ _ #(x t +2=i,xt+i=x,x t =j) 
\J^3,x,lfi,j — x-2 

Compute the SVD (singular value decomposition) of P^i and let U be the matrix of 

the left singular vectors corresponding to the K largest singular values. 

For each state, compute the matrix C x > defined in 0 for the anchor observation x' and 

compute the eigenvector v corresponding to the largest eigenvalue. 

Combine all the eigenvectors v's from the different states. 

Infer the emission matrix O using ([2]). 

Infer the initial state vector tt and the transition matrix A using @ and Q. 
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2.3 Handling technical issues 

Since the observation data is noisy, some numerical issues can occur when computing the SVD 



and eigendecomposition. Previously, (Cohen et al. 2013) described implementation issues 



using spectral learning for another latent variable model in natural language processing. We 
implemented several optimizations to solve the analogous issues for HMM. 

2.3.1 Smoothing of observation matrices 

The observation pairs are smoothed out as follows. When A is the smoothing factor between 
0 and 1, 



{Pair} it 



•3 



#(x t+ i =i,x t = j) 
#(x t+1 ,x t ) 



S P„,V\ - #( Xt + l = - #(zt+l =i,X t = j) 



#(xt+i,x t ) 1 £p. N #(x t+ i,x t ) 



{Pair }ij = {Pair}i y , ■ {Pair}j 
{Pair'} itj = \{Pair} itj + (1 - \){Pair 2 }^ 

Pair' is used as the smoothed observation pair data. 

The observation triples are smoothed out using an obvious generalization of the method 
for smoothing the pairs matrix (see Supplementary Material for details). After the data 
smoothing, the observation triple and pair matrices are normalized such that the sum of all 
elements is equal to 1. Intuitively, the smoothing method is similar to adding pseudocounts 
to sparse data matrices except it uses the marginal frequencies of the observations instead of 
a uniform pseudocount. 

The second and third moments, P2.l1 P3,l> P3,x,i are computed using these smoothed triples 
and pairs matrices. Specifically, P2,i = Pair' and {Pz,x,l\i,j = {Triple'}i, x j for 1 < i,x,j < 
N. We observed that A between 0.9 and 1 seemed to work well for different numbers of states 
and different cell types. We set A = 0.95 as a default but this parameter is adjustable by the 
user on the command line of our software. Note that with this setting of A the vast majority 
of the signal comes from the original data itself and not from the smoothing method. 

2.3.2 Handling negative probabilities 

Our estimated probabilities should be non-negative but the estimated parameters can have 
negative values because the signs can be flipped while performing the matrix computations. 
Therefore, we took the absolute value of the estimated parameters and normalized the pa- 



rameters following (Cohen et al. 2013) 



2.3.3 Parameter adjustment 

Although we estimated the HMM parameters using spectral learning ([!]), these estimates are 
affected by noise in the data and numerical precision errors. Therefore we took the estimates 
from spectral learning and performed a few iterations of local optimization of the likelihood 
using EM, following other researchers in the field (K. Chaudhuri, personal communication). 
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The number of iterations was set to five for the further analysis. We call this method Spec- 
tacle (SPECTral learning for Annotating Chromatin Labels and Epigenomics) . 



3 Results 



3.1 Data sets and experimental settings 

For the histone modification data, we downloaded data for eight histone marks (H3K4mel /2/3, 
H3K9ac, H3K27ac, H 3K27me3, H3K36me3 and H4K20mel ) for an ENCODE Tier 1 cell type, 
GM12878 (version 1) ( |ENCODE Project Consortium|2012l ) . For external biological validation 
of our predictions, we used transcription start site (TSS) data using Cap Analysis of Gene 



Expression (CAGE) from GENCODE ( |Harrow et aX|2012| version 10, May 2012), RNA Poly- 
merase2 (Pol2), P300 and CTCF ChlP-seq data, and polyA RNA-seq data from ENCODE, 
long intergenic non-coding RNA (lincRNA) RNA-seq data from ( Kelley and Rinn||2012 ), and 
conserved enhancer regions from the VISTA enhancer browser (Visel et al. 2007). All of the 



data sets except the enhancer data are specific to the cell type that we used for training the 
HMM. 

In order to test the performance of Spectacle over multiple cell types, we compared it 
to ChromHMM on data from nine other cell types (Hl-hESC, HeLa-S3, HepG2, HMEC, 
HSMM, HUVEC, K562, NHEK and NHLF). 

There are 15,181,508 segments when the human reference genome is divided into 200bp 



segments. The histone modification data was binarized by the program, Scripture (Guttman 



et q/.| ( |2010D ), which was originally developed for transcriptome reconstruction from RNA-seq 
reads and can also be used for ChlP-seq peak calling. From this, our final data set consisted 
of presence/absence calls for each histone mark for each segment. Thus the observation space 
consisted of all combinations of histone marks and in our case the number of observations 
was 2 8 = 256. We fixed the number of states to 15 unless stated otherwise, similar to the 
number of states used in previous studies ( Ernst et a/.|2011 ). This number of states is readily 
interpretable biologically and allows us to compare our annotations with previously published 
annotations. 



3.2 Spectral learning outperforms the EM algorithm 
3.2.1 Initialization of the EM algorithm 

The performance of the EM algorithm varied greatly depending on the parameter initializa- 
tion. For instance, when the number of states was 15, we ran the EM algorithm to convergence 
for 10 random parameter initializations. The log-likelihood varied from — 8.33-E6 to — 1.11E7 
(average: — 9.51S6, std. dev.: 7.67-E5) and the number of iterations needed for the likelihood 
to converge varied from 16 to 200. Even the highest log-likelihood was lower than that of the 
heuristic initialization method, information, which was — 7.81E6. These results demonstrate 
the importance of good parameter initializations when running the EM algorithm. 



3.2.2 Comparison of the likelihoods 

We compared Spectacle with ChromHMM in terms of the likelihood of the observed data 
given the parameters. For almost all numbers of states we tested, Spectacle had a higher 
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likelihood than ChromHMM. The exceptions were when the number of states was 25, 45 
and 50 (Table [T]). When the number of states was 100, the difference in the performance 
of the two methods was highly significant and the likelihood found by ChromHMM was 
lower than for smaller numbers of states, which suggests that the EM algorithm found a local 
optimum at this particular number of states. ChromHMM took 20-200 EM iterations to 
converge to a local optimum in the likelihood. When we ran ChromHMM for only five 
iterations to make its runtime comparable to Spectacle, we found that Spectacle had a 
higher likelihood than ChromHMM for all numbers of states that we tested. Together these 
results show that when the runtimes are similar, spectral learning outperforms EM, and even 
when spectral learning takes a much shorter runtime, it still generally performs better than 
EM as judged by the likelihood. 



3.2.3 Comparison of training times 



We found that Spectacle had a significantly faster training time than ChromHMM (Figure 
[TJ. Spectral learning takes much less compute time than even one iteration of EM and 
Spectacle only needed five iterations of local optimization of the likelihood. Over the range 
of states that we investigated, Spectacle was 3.4-66.8 times (in CPU time) and 4-40 times 
(in number of iterations) faster than ChromHMM. 

The training time of one iteration of EM is linear in the number of segments and the 
number of observations, and quadratic in the number of states. On the other hand, the 
training time of spectral learning is dominated by the computation time of the SVD which 
is cubic in the number of observations if all singular values are computed exactly (this is 
the running time of a practical implementation - theoretically the cubic dependence can be 



improved to the exponent of matrix multiplication (Demmel et al. 2007) which is currently 
< 2.373 (Williams 2012)). Thus after the initial creation of the sample moment matrices, the 



training time of spectral learning is independent of the genome size and can be much faster 
than the EM algorithm. 



3.2.4 Comparison of likelihoods over multiple cell types 

We compared both methods over multiple cell types to assess the performance of Spec- 
tacle over a range of data sets. We examined ten cell types including GM12878 and ran 
Spectacle and ChromHMM with 15 hidden chromatin states. Spectacle found higher 
likelihoods for 8 out of 10 cell types while also achieving a 2.6-12.6 fold (in CPU time) and 
2.6-13.4 fold (in number of iterations) faster training time (Table [2]). 



3.2.5 Empirical sample complexity of the algorithm 

Theoretically, the spectral learning algorithm for HMMs is known to have polynomial sam- 
ple complexity (Hsu et al. 2012). To complement these theoretical results, we empirically 
estimated the sample size required for training the parameters accurately. We trained the 
parameters of Spectacle for 15 states using the whole genome, half the genome, a quarter 
of the genome, and each chromosome individually using the cell type K562. We then tested 
the parameters by computing the log-likelihood of the whole genome of another cell type, 
GM12878, so as not to mix training and testing data sets. We found that the log-likelihood 
decreased as the sample size decreased as follows: — 7.62E6 for the whole genome; average: 
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— 7.61-E6 and std. dev.: 3.6-E4 for half the genome; average: — 7.67-E6 and std. dev.: 1.3-E5 
for a quarter of the genome; and average: — 8.13-E6 and std. dev.: 5.7E5 for individual chro- 
mosomes. This analysis suggests that sample sizes of roughly one chromosome are insufficient 
for the spectral learning method to accurately train the parameters. 

3.3 Comparison of the chromatin state annotations 

We estimated the HMM parameters using Spectacle for 15 chromatin states for one of the 
ENCODE cell lines, GM12878. Then we assigned the hidden states to segments using the 
posterior decoding algorithm, which assigns the most likely state to each segment. We used 
this method for both Spectacle and EM. We performed external validation of the predicted 
states using experimental data other than the histone modification data. We computed the 
enrichment of signals such as histone modification marks and independent biological datasets 
in the hidden HMM states (Figure S1|S2 ). The hidden states were manually annotated as 



TSS (transcription start site), enhancer, lincRNA, or exon (transcribed regions) according to 
the enrichment of signals using a similar procedure to previous works ( |Ernst and Kel lis 2010 



Hoffman et al. 2013) (see Supplementary Material for details). 

Two states for Spectacle and four states for ChromHMM were annotated as TSS. 
Spectacle predicted TSS or TSS plus flanking regions(±lkb) significantly more accurately 



than ChromHMM while using a smaller number of states (Figure 2|3 ) . TSS regions represent 



only 0.07% of the whole genome. About 93% of TSSs were found within the inferred TSS states 
in the HMM consisting of about 4% of the genome (~22 fold enrichment). Also, the flanking 
regions (±lkb) of TSS including promoter regions have similar histone mark patterns to TSS. 
About 89% of TSS flanking regions was found within the TSS states (~21-fold enrichment). 
Furthermore, the states have the highest enrichment of H3K4me3, H3K27ac, Pol2 and initial 
exon among the states, all of which are indicators of TSS and promoters (Figures S1|S2 ). 



Transcribed regions also have specific epigenetic patterns (Figure |4j). The TSS states and 
six other states for Spectacle and three out of the four TSS states and five other states for 
ChromHMM were annotated as Exon states (~3-fold enrichment), and they were enriched 



with H3K36me3, which is a signal for gene bodies (Bannister et al. 2005). 

Another important class of functional elements is lincRNAs (long intergenic non-coding 
RNAs) (Figure [5]). It is known that lincRNAs have the "K4-K36 signature" ( |Khalil et al. 



2009), that is, H3K4me3 enrichment at the transcription start site and H3K36me3 throughout 
the transcribed regions. Since this histone pattern is similar to protein-coding genes, lincRNAs 
are identified by searching for K4-K36 domains outside of known protein-coding genes. In fact, 
the TSS states and two of the Exon states for Spectacle and three out of the four TSS states 
and one Exon state for ChromHMM were annotated as lincRNAs (~4-fold enrichment). 

Lastly, four states for both Spectacle and ChromHMM were annotated as Enhancer 
(Figure[6j) (~2-fold enrichment). The Enhancer states did not overlap with TSS states and had 
a higher enrichment of H3K4mel/2 than that of H3K4me3. Two states among the Enhancer 
states had high enrichment of both P300 and CTCF. 

Overall, our results showed improvements in predicting all classes of functional elements 
that we tested, compared to a previous state-of-the-art method based on the EM algorithm. 
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4 Discussion 



We have presented an improved method and software tool called Spectacle (SPECTral learn- 
ing for Annotating Chromatin Labels and Epigenomics) for annotating chromatin states in the 
human genome. At the heart of our method is a state-of-the-art spectral learning method for 
unsupervised learning of the parameters of a Hidden Markov Model (HMM). To implement 
the spectral learning algorithm in practice, we made several technical modifications which 
improved its accuracy and numerical stability on the data sets we tested. These modifications 
may be of broader interest in other applications, including other computational biology prob- 
lems, computer vision and natural language processing. We showed that Spectacle outper- 
forms a previous state-of-the-art method, ChromHMM, by finding higher likelihoods in most 
data sets tested, having a much faster training time and more accurately annotating several 
independent biological datasets. Recently, the ChromHMM code was updated to add an op- 
tion for learning parameters in parallel using multiprocessors. Using the option would improve 



the runtime of both Spectacle and ChromHMM (http://compbio.mit.edu/ChromHMM/ 



Our software implementation is freely available online and is lightweight and easy to use on a 
regular desktop without the need for specialized computer hardware. Our code modifies the 
ChromHMM code which has been used by several groups so we believe it will be user-friendly 
and accessible. 

Spectral learning can be applied to multiple problems in computational biology. A recent 
work used spectral learning for poly(A) motif prediction (Xie et al. 2013). The authors 



did not try to recover the HMM parameters explicitly but instead learned them up to an 
unknown invertible linear transformation and used the transformed parameters as features for 
classification by a Support Vector Machine. Another recent paper ( Zou et aT|2013 ) applied a 
spectral learning algorithm for contrastive learning to a problem involving epigenome maps. 
This is a more restricted version of the problem we study here in which one is specifically 
contrasting two data sets instead of annotating a single data set. Since Hidden Markov 
Models and the Expectation-Maximization algorithm are used in many other problems in 
computational biology, it is possible that the methods described here may be useful in those 
settings. 

In the future, having a faster and more accurate chromatin state annotation tool should 
be useful for annotating multiple epigenomic maps. Previous works found associations be- 



tween disease-causing variants and epigenomic marks (Maurano et al. 2012 Kasowski et al. 



2013; Kilpinen et al. 2013; McVicker et al. 2013), suggesting that better understanding of 



the epigenome might help interpret the variants underlying human disease. For example, Ka- 
sowski et a/.||20i3| used ChromHMM to infer variation in chromatin states across individuals, 
so we expect Spectacle to be useful for other similar types of data sets in the future. Indeed 
there are currently several different major epigenomics projects (e.g., BLUEPRINT, Roadmap 
Epigenomics Project), producing chromatin mark data for many cell types and human pop- 
ulation (Adams et aZ.I 2012 Bernstein et al. 2010). Given the rapid decrease in the cost of 



sequencing, we also expect that many more epigenomic maps will be produced in different 



cell types, human populations ( Kasowski et aZ.||2013[) , species, environmental conditions and 



developmental contexts ( |Ernst et al.|2011[|Zhu et al.|2013[ ). Thus we expect that the need for 
fast and accurate tools for processing this type of data will continue to grow in the future. 
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6 Supplement 

Parameter initialization method: "information" 

For completeness we briefly describe the parameter initialization method described by Ernst 
and Kellis. We refer the reader to the original paper for a full description. The number of 
hidden states is denoted as K. The "information" method iteratively partitions the segments 
of the genome into K groups based on the presence of a selected chromatin mark. The group 
assignment for all segments is considered as a crude prediction of the hidden states, and 
the emission, transition and initial probability parameters are inferred accordingly from the 
hidden state prediction. 

The "information" method assigns groups to segments as follows. In the first step, all 
segments are assigned to group 1. Then for each step, the method creates a new group by 
partitioning an existing group into two groups. It repeats this partitioning step (K-l) times. 
To choose a group to partition, the method finds the best group and chromatin mark for 
splitting the group that maximizes an entropy measure. The segments in the group which 
have the mark constitute a new group and the segments in the group which do not have the 
mark remain in the group. 

Data smoothing for triples 

The observation triples (x t , xt+i, %t+2) are smoothed out as follows. 



{Triple}i )j)k 



{Triple}^. 
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Triple' is used as the smoothed observation triple data. 
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Enrichment of biological signals and state annotation 



The enrichment of a signal (either a chromatin modification or a biological annotation) for a 
state is calculated as follows: 



log 2 



Number of segments with the signal in the state 
Number of segments with the signal 
Number of segments in the state 
Total number of segments 



Our approach is semi-automated in the sense that we assign biological interpretations 



to the hidden states manually, similar to previous works (Ernst and Kellis (2010); Hoffman 



et al. (2013)). We based our manual annotations on experimental evidence from (Barski et al 



fl2007D ; |Heintzman et al\\2Qm\ ). States with positive enrichment of Pol2, H3K27ac, H3K4me3 
and H3K3mel, as well as greater enrichment of H3K4me3 compared to H3K4mel were an- 
notated as "TSS". States with positive enrichment of Enhancer, H3K4me3 and H3K3mel, as 
well as greater enrichment of H3K4mel compared to H3K4me3 were annotated as "Enhancer" . 
States with positive enrichment of lincRNA and positive enrichment of either H3K4me3 or 
H3K36me3 but not annotated as "Enhancer" were annotated as "lincRNA" . States with pos- 
itive enrichment of both Exon and H3K36me3 were annotated as "Exon" . 

Instead of the above manual assignments, we also ranked states by enrichment for each 
biological dataset and plotted the Precision-Recall curves. The results were consistent in that 
Spectacle always performed the same or better than ChromHMM. 



7 Tables and Figures 
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Table 1: The log- likelihood of Spectacle was higher than that of ChromHMM as the 
number of states was varied. Within the parentheses is the log-likelihood ratio between 
ChromHMM with only five EM iterations and Spectacle. 
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ratio (xlO 4 ) 


GM12878 


-7.81e+06 
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13.3 



Table 2: The log-likelihood of Spectacle is higher than that of ChromHMM for ten cell 
types. Within the parentheses is the log-likelihood ratio between ChromHMM with only 
five iterations of EM and Spectacle. 
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Figure 1: The training time of Spectacle vs. ChromHMM. 
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Figure 2: Precision- Recall curve for prediction of TSS when the number of states is 15. 
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Figure 3: Precision-Recall curve for prediction of TSS plus flanking regions (within ±lkb) 
when the number of states is 15. 
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Figure 4: Precision-Recall curve for prediction of the transcribed exons when the number of 
states is 15. 
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Figure 5: Precision- Recall curve for prediction of lincRNAs when the number of states is 15. 
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Figure 6: Precision-Recall curve for prediction of enhancers when the number of states is 15. 
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Figure SI: The enrichment of signals for each state when the number of states is 15 for 
Spectacle. TSS states are 7 and 8, enhancer states are 4, 10, 12 and 15, lincRNA states are 
3, 5, 7 and 8, and transcribed exon states are 2, 3, 4, 5, 7, 8, 12 and 13. 
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Figure S2: The enrichment of signals for each state when the number of states is 15 for 
ChromHMM. TSS states are 4, 7, 12 and 15, enhancer states are 1, 6, 9 and 14, lincRNA 
states are 7, 8, 12 and 15, and transcribed exon states are 1, 4, 7, 8, 9, 10, 11 and 12. 
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